In [1]:
using GLM
using Statistics
using DataFrames
using Plots
using Random
plotly()
Out[1]:
Linear regression is a type of statistical modeling. Modeling, as the name suggest, creates a model from data. A model takes new data and uses it for prediction.
In linear regression that prediction is a continuous numerical variable. This variable is termed the target variable or the dependent variable. One or feature variables or independent variables are used in the prediction. A model learns from the feature variable(s) to predict the target variable.
In the code below, we create two arrays. The first represents a feature variable and the second a target variable.
In [2]:
Random.seed!(1) # For reproducible results
X = rand(Float16, 50)
y = X .* 10 .+ (2 * randn(50));
In [3]:
plot(X, y, seriestype = :scatter, title = "Scatter plot", label = "data")
Out[3]:
One possible model is shown below (red line). The model is linear (a starigh line). The equation for a straight line (from school algebra) is $y = m x + c$. here, $m$ is the slope and $c$ is the $y$ intercept (where $x=0$. In statistics we typically use the symbols $m = \beta_1$ and $c = \beta_0$.
In [4]:
x_vals = collect([0:0.01:1])
plot!(x_vals, 10 * x_vals, label = "model")
Out[4]:
The feature variable number is multiplied by $10$ ($m = \beta_{1} = 10$) and we add no intercept ($\beta_{0} = 0$). Since this gives us a prediction of the target variable and not the actual target variable, we use the notation $\hat{y}$, (1).
Let's view the first $X$ variable value, $X_1$.
In [5]:
X[1]
Out[5]:
It is $0.1699$. It's target value is found in y[1].
In [6]:
y[1]
Out[6]:
Our model predicts $\hat{y}_{1} = 10 {X}_{1} = 10 \times 0.1699 = 1.699$. This is a bit off. The difference between a preicted target value and the actual target value is called the error or the residual, denoted by the symbol $\epsilon$. We calculate the residual for the first entry in the data to be ${\epsilon}_{1} = 2.617 - 1.70 \approx 0.917$ To get the actual value back, we must add an error value to each predicted value, (2).
Below, we calculate the predicted values for all of our data and then the rseiduals.
In [7]:
y_pred = 10 * X
res = y - y_pred;
We can create a histogram of all the resiudals.
In [8]:
histogram(res, bins = 10, title = "Residuals", label = "residuals")
Out[8]:
Another way to visualize the residuals is to look at is as a scatter plot. Each residual is above or below the $y=0$ line.
In [58]:
plot(collect([1:1:length(y)]), res, seriestype = :scatter, title = "Residuals", label = "residuals")
Out[58]:
This model can now be used on new data. Given any feature variable value, the model will predict a target variable value. This is the red line in the model plot above.
To see how good our model is, we create a loss function.
A loss function determines the difference between a set of predicted values and thhe actual values for the target variable. In linear regression, we typically use the mean square error (MSE), (3).
Wriite out wil expansion and the calculations for $\hat{y}$, we have (4).
Here $n$ is the sample size. We subtarct each actual target value from it predicted value and square this difference. This ensures a positive error term for each sample. We add all these squared errors and divide by the sample size to give us a mean squared error.
Let's look at the first three samples.
In [10]:
y_pred[1:3]
Out[10]:
In [11]:
y[1:3]
Out[11]:
In [12]:
X[1:3]
Out[12]:
The squared differences are shown in (4).
The abbreviated calculation is shown in (5).
We can create a funtion to do the calculation of the MSE for us.
In [13]:
function mse(hat_y , y)
return sum((hat_y - y).^2) / length(y)
end
Out[13]:
In [27]:
mse_model = mse(y_pred, y)
mse_model
Out[27]:
We see a MSE of $3.97$.
Perhaps, we can do a bit better. To explain how, we start with the worst possible model. This base model simply uses the mean of the target variable as the predicted value.
We create a new array of predicted values, based on the mean of the target variable.
In [15]:
mean(y)
Out[15]:
In [16]:
y_pred_mean = repeat([mean(y)], length(y));
Let's look at the MSE of this model.
In [26]:
mse_mean_model = mse(y_pred_mean, y)
mse_mean_model
Out[26]:
We can compare a model to the base model, (6).
When we multiply the result by $100$ we express the comparison as a percentage. Because this is a ratio, the result will always be between $0$% and $100$%. We can can say the according to the model the feature variable explains comparison % of the variance target variable. Alternatively, we can state that the model reduces the variance in the base model's prediction by comparison %.
In our case the comparison is as shown below.
In [28]:
(mse_mean_model - mse_model) / mse_mean_model
Out[28]:
We can say that our model with $\beta_{0} = 0$ and $\beta_{1} = 10$ explains $67.7$% of the variance in the target variable.
In [30]:
function cost_function(X, y ,θ)
n = size(X)[1] # Sample size
preds = X * θ # Matrix multiplication
loss = preds - y # Calculate the loss vector
cost = (1/(2n)) * (loss' * loss) # Half mean squared cost
return cost
end
Out[30]:
In [32]:
function linear_reg_grad_d(X, y, α, fit_intercept=true, n_iter=1000)
m = length(y) # Sample size (number of rows)
if fit_intercept
constant = ones(m, 1)
X = hcat(constant, X)
else
X
end
n = size(X)[2] # Number of features (columns)
θ = zeros(n)
𝐉 = zeros(n_iter) # initial cost vector
for iter in range(1, stop=n_iter)
pred = X * θ
𝐉[iter] = cost_function(X, y, θ)
θ = θ - ((α/m) * X') * (pred - y);
end
return (θ, 𝐉)
end
Out[32]:
In [48]:
LinRange(1, 10, 10)
Out[48]:
In [56]:
cost_values = linear_reg_grad_d(X, y, 0.01)[2]
time_step = LinRange(1, 1000, 1000);
In [57]:
scatter(x = time_step, y = cost_values)
Out[57]:
In [18]:
df = DataFrame(Feature = X, Target = y)
first(df, 5)
Out[18]:
In [19]:
ols = lm(@formula(Target ~ Feature), df)
Out[19]:
In [25]:
exp(9.64955)
Out[25]:
In [20]:
y_pred_best = predict(ols);
In [21]:
mse(y_pred_best, y)
Out[21]:
In [22]:
(12.280217763711272 - 3.8521425072619944) / 12.280217763711272
Out[22]:
In [23]:
r2(ols)
Out[23]:
In [ ]: